Does Hugging Face Datasets Support Efficient Referencing of Images to Avoid Duplication?

pp010123 · June 1, 2025, 6:05am

I’m currently building a dataset using Hugging Face Datasets. Each image in my dataset has approximately 15 different annotations. My dataset is structured such that each data item corresponds to exactly one annotation (i.e., the i -th annotation can be accessed directly with dataset[i] ). Thus, if I have one image with 15 annotations, it results in a dataset of 15 items, each containing the same image paired with a different annotation.

This approach currently causes inefficient storage usage because the same image is stored repeatedly.

I prefer to avoid external URLs or external storage solutions and would like to:

Store each image exactly once within an Arrow file or similar bundled format.
Store annotations with internal references to these images.
Dynamically load and pair the image from this internal reference with the corresponding annotation upon accessing dataset[i] .

I have this specific constraint because I am interested in understanding whether this approach can be achieved without altering my existing framework. I understand that using external URLs or storage would simplify the problem significantly, but I’d like to avoid that if possible.

Does Hugging Face Datasets support this internal referencing mechanism to efficiently manage images without redundant duplication or external downloads? If so, could anyone provide guidance or examples on how to implement this approach effectively?

John6666 · June 1, 2025, 8:47am

If you are primarily concerned with preventing duplication, it may be better to save files by URL or file name, but this may not be very convenient for large datasets. @lhoestq

pp010123 · June 1, 2025, 4:22pm

I see, thank you. Since my use case is quite specific, it seems I will need to implement it myself.

Topic		Replies	Views
How is duplicate data in dataset splits/subsets handled in the hub 🤗Hub	1	61	August 17, 2024
Uploading image dataset to Huggingface Hub 🤗Datasets	2	2556	October 14, 2022
Image dataset best practices? 🤗Datasets	9	16985	January 15, 2023
Semantic Correspondence Dataset 🤗Datasets	1	160	April 23, 2024
Train through multiple datasets Beginners	1	1616	June 13, 2022

Does Hugging Face Datasets Support Efficient Referencing of Images to Avoid Duplication?

Related topics